Day 1: Getting Comfortable with Uncertainty
05 Sep 2021
In the interdisciplinary field of analytics, a good understanding of statistical theory is important for drawing insights from data. Many modern analytics problems involve the application of advanced statistical methods and cannot be performed without the use of software. Therefore, this course presents the theory behind various statistical methods in combination with the programming skills required to implement them on data. Topics to be discussed include:
There are no formal pre-requisites, but a basic working knowledge of the fundamental concepts and methods of statistics is presumed. However, because each student brings a different level of statistics expertise to this course, I’ll infuse the course with overview and review of the most important intro stat concepts and methods that are most relevant to statistical modeling and data-based analytics.
There are no formal course co-requisites, but the content presented in this course will overlap in spirit and content with several other courses in the program: Data Mining; Descriptive, Predictive and Prescriptive Analytics; Data Visualization; and Data Management.
Reinforce fundamental introductory statistical concepts and methods:
The text for this course is Introduction to Statistical Learning (with applications in R by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. E-textbook and hard-copies of the text can be purchased from Amazon, however the authors have made a PDF version of the text available. I’ve provided a copy of this PDF version here for you to download. In addition, the authors have provided slides, videos and an R package to accompany the text. You’ll note that to complete the course, I’ve tasked you with specific reading chapters from the text each week. If you would like to use the slides and videos, that’s fine but they should be used as an additional resource to complement reading the text - not a substitute for it.
Additionally, the following resources may be helpful:
Throughout the course we will use the R Project for Statistical Computing and the RStudio Integrated Development Environment. A number of R Packages will be used throughout the course, students will be able to download and install them on the fly.
Friday 15 October (a.m.)
Friday 15 October (p.m.)
Saturday 16 October (a.m.)
Saturday 16 October (p.m.)
You be familiar or have experience in one or more of the following domain areas within analytics
With few exceptions, when these terms are used to discuss algorithms what is really meant is the broader field of machine learning
Despite being a broad field, most machine learning algorithms fit into three main categories
Classes of Machine learning algorithms with applications (reference: https://medium.com/@sanchittanwar75/introduction-to-machine-learning-and-deep-learning-bd25b792e488)
Supervised learning algorithms
Unsupervised learning algorithms
You might use a
You could use an
An
Finally,
Example applications of regression algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are numeric and continuous
Classification is focused on predicting/labeling a discrete output (categories)
There can be more that two (yes/no) categories
Example applications of classification algorithms
The following algorithms can be used to model (describe) the relationship between inputs and outputs when the outputs are discrete or categorical
Before moving forward I want to make sure that you have a solid (yet high level) understanding of how supervised learning algorithms work under the hood
All supervised learning algorithms have the following elements
Suppose you’re asked to create a model to describe the relationship between one set of inputs and one set of outputs
x and the set of outputs yx and y in the figure belowPlot of some ideal data
Observing the figure, there doesn’t appear to be any uncertainty in the data as each point falls on a straight line
An obvious choice for a model to describe this data would be a function of the form \(y = mx +b\) - the familiar equation for a line
Adding a plot of the line \(y(x) = mx+b\) shows that a “perfect” model exists for our ideal data
Fitting the ideal data with a “perfect” model
Now that we’ve chosen a functional form for \(f_{\text{imperfect}}\) we turn our attention to the parameters of the model
Every model comes with parameters – the values assigned to these parameters affect how well \(f_{\text{imperfect}}\) represent the relationship between x and y
For our chosen \(f_{\text{imperfect}}\) we see that there are two paramters the slope (\(m\)) and the intercept (\(b\))
The question, of course is what are the correct values of the slope \(m\) and the intercept \(b\) that best represent the relationship between x and y
For this ideal data set, we can determine these values using our knowledge of straight lines and our ability to read or we could read the value of the intercept directly from the plot as \(b = 3\)
Using this value for \(b\) we can choose any of the data points and solve for the slope \(m = 5\)
I know what you’re thinking: What does this have to do with machine learning?
x and y constraining this relationship the form of \(f_{\text{imperfect}}\) that we chose aboveA
Loss functions are a key component in all supervised learning algorithms
In many cases, you don’t have to choose a loss function you use to find the optimal parameter values
Rather, it comes as part of a package deal with the modeling approach you choose (i.e. linear regression, logistic regression, etc.).
You can, however, come up with your own loss function - so long as it produces meaningful results
For our perfect example data, we can choose among several different loss functions
However, not all of them will return an accurate solution
In the sections below I walk through the choice of several different loss functions and plot the results
In this case we use a naive loss function that represents the difference between the observed output and the output returned by the proposed model
The loss function is expressed as shown below
\[ Loss_{_{naive}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N y_i-m\times x_i-b. \]
Using this function, loss would simply be defined as the sum of the vertical distances between each observed output \(y_i, i = 1,...,n\) and the output returned by the chosen model
The parameters \(m\) and \(b\) for the best-fit line correspond to model that has the minimum loss.
For our “ideal” data, the points fall on a straight line and we would expect the loss value in this case to be zero
Thus far, we’ve chosen a functional form that we believe is a good representation of the data – and a corresponding loss function
In the chunk below we define our naive loss function
loss_naive <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum(y - m * x - b))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the data$par
[1] 2.684172e+55 4.996335e+54
$value
[1] -1.111779e+57
$counts
function gradient
501 NA
$convergence
[1] 1
$message
NULL
Looking at these results, it’s clear that something isn’t right - why?
\[ Loss_{_{absolute}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big\vert y_i-m\times x_i-b\Big\vert. \]
# First define a function to optimize
loss_absolute <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum(abs(y - m * x - b)))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the dataoptim(par = c(1,1), # provide starting values for m and b
fn = loss_absolute, # define function to optimize
x = df$x, # provide values for known parameters
y = df$y, # provide values for known parameters
control = list(fnscale = 1))$par
[1] 5 3
$value
[1] 3.301596e-06
$counts
function gradient
121 NA
$convergence
[1] 0
$message
NULL
The problem is that linear functions are unconstrained
A better option would be to propose a loss functions that is convex, such as
\[ Loss_{_{convex}}(\mathbf{y},\mathbf{x},m,b) = \sum_{i=1}^N \Big( y_i-m\times x_i-b\Big)^2. \]
Note that we are minimizing the squared distances between the observed values and the proposed model - hence this is called
For the last time let’s define our convex loss function
loss_convex <- function(params,x,y) {
if(length(params) != 2) stop("Params should be a length 2 vector")
m <- params[1]
b <- params[2]
return(sum((y - m * x - b) ^ 2))
}stats::optim() function to find the values of \(m\) and \(b\) that minimize the loss function and result in a model that best-fits the dataoptim(par = c(1,1), # provide starting values for m and b
fn = loss_convex, # define function to optimize
x = df$x, # provide values for known parameters
y = df$y, # provide values for known parameters
control = list(fnscale = 1))$par
[1] 4.999800 3.000373
$value
[1] 2.501828e-06
$counts
function gradient
71 NA
$convergence
[1] 0
$message
NULL